statfandomcom-20200213-history
Matrix factorization
Notation Below: * The letters: ** u denotes a user ** i denotes an item ** j also denotes an item (if there are two items, they are denoted by i and j) ** f denotes a feature number ** g() denotes an arbitrary function ** r denotes rating (or anything else which a recommender try to forecast, i. e. probability to select an item) ** p and q denotes user and item feature vector, correspondently (see explanation in text) * r_{ui} is the rating given by user u to item i * \hat{r}_{ui} is the predicted rating given by user u to item i * R(u) is the set of all items rated by the user u, while the rating is known. * N(u) is the set of all items rated by the user u, while the rating may be known or unknown. * R(i) is the set of all users who rated the item i, while the rating is known. * N(i) is the set of all users who rated the item i, while the rating may be known or unknown. SVD Each user u receives a feature vector p_u\in \mathbb{R}^n . Each item i also receives a feature vector q_i\in \mathbb{R}^n . The rating prediction is: : \hat{r}_{ui} = p_u^{T}q_i Sometimes, user and item biases are used explicitly: : \hat{r}_{ui} = p_u^{T}q_i + b_u + b_i + \mu Here, b's are additional parameters of the model called biases. Feature vectors (and also b 's and \mu , if relevant) are parameters of the model. Learning the model means finding feature vectors. Standard training Algorithm Objective: : Minimize with respect to Q,P,b \sum_{(u,i):i\in R(u)} ( r_{ui} - \mu - b_u -b_i - p_u^{T}q_i ) ^ 2 + \lambda ( b_u^2 +b_i^2 + ||q_i||^2 + ||p_u ||^2 ) The first term is the error, while the second term is the relaxation term. Gradient descent For each pair (u, i) from the training set do: : Calculate residual: d_{ui} = r_{ui} - p_u q_i\! : \Delta q_i = \eta \left(d_{ui} p_u -\lambda q_i\right)\! : \Delta p_u = \eta \left(d_{ui} q_i -\lambda p_u\right)\! : q_i \leftarrow q_i + \Delta q_i \! : p_u \leftarrow p_u + \Delta p_u \! We will refer to these as standard formulas. Sweeping through the training set and updating the values with standard formulas is probably the simplest of all SVD algorithms. Alternating least squares Note that for each user u, minimizing the expression : \sum_{i\in R(u)} (r_{ui}-p_u q_i)^2 constitutes a linear regression problem. Therefore, given item features, it is easy to find the user features. Similarly, given user features, it is easy to find the item features. This leads us to the alternating least squares algorithm, described by Bell, Koren and Volinsky in their Netflix Prize model http://www2.research.att.com/~volinsky/netflix/ProgressPrize2007BellKorSolution.pdf: Start with random p's and q's. Repeat until the error stops to improve: Given q's, find p's with ridge regression Given p's, find q's with ridge regression Bell, Koren and Volinsky obtained an additional accuracy boost by employing a non-negative solver, thus requiring that all features would be non-negative. They had also try the lasso regression, which, they claimed, was not as good as the ridge regression, but, being added to the blend, improved the total error. Another idea by Zhang and Koren, also used by Bell, Koren and Volinsky team, was to regularize user features based on rich Gaussian prior.http://users.soe.ucsc.edu/~jonathan/fp387-zhang.pdf Neighbourhood-aware SVD Below is an excerpt from the Netflix prize forum, by PragmaticTheory, one of the future winners of the contest: The neighborhood-aware factorization is a nice enhancement that is easy to implement, as least if you are using alternating least square regression. To answer the question, you train the regular SVD (4.3) first. Then, when making the predictions, you recompute the user factors for each user/item pair you want to predict, keeping the item factors fixed, but overweighting items similar to the item you want to predict. Suppose that you did SVD and want a prediction for user 10, item 3. Then you train user again with ridge regression (see previous section) on all items, but items which are close to the item 3 (meaning their feature vector q_i is close to q_3 ), receive higher weights. A good weight is \left({q_i q_j \over |q_i| |q_j|}\right)^n , where n is some big number (i.e. 20) Asymmetric SVD For asymmetric SVD (ASVD), user features are estimated directly from the ratings. There are two kinds of ASVD. First kind of ASVD is: : p_u = \sigma\left(b + \sum_{j \in R(u)} w_{r_{uj}} s_j \right) where * \sigma denotes the sigmoid function: \sigma(x)={1\over 1+e^{-x}} * r_{uj} is the rating given by user u to item j * R(u) is the set of items rated by the user u, with known ratings * w, b, S are parameters of the algorithm. Finding the optimal w, b and S is done by the gradient descent with relaxation. Sometimes in competitions, it is known that a user rated an item, but the rating is unknown. For such cases, we can apply a separate rating (0). Another kind of ASVD is: : \hat{r'}_{ui}= q_i^{T} p_u : p_u = |R(u)|^{-0.5} \sum_{j \in R(u)} (r'_{uj} s_j) where * {r'}_{ui} = r_{ij} - (\mu + b_u + b_i) * r_{uj} is the rating given by user u to item j * R(u) is the set of items rated by the user u, with known ratings * \mu , b, S, Q are parameters of the algorithm. Note that in case of infinite statistics, for the second form of AVSD, the vector p_u grows indefinitely, while in the first form of ASVD it converges to some limit. There is also a version of these algorithms where S=Q. The second form is also called NSVDD by Toscher, Jahrer and Bell http://www.netflixprize.com/assets/GrandPrize2009_BPC_BigChaos.pdf. NSVD NSVD1 is similar to ASVD, but user features are estimated from the items selected by the user, rather than ratings. : \hat{r'}_{ui}= q_i^{T} p_u : p_u = |N(u)|^{-0.5} \sum_{j \in N(u)} s_j where * {r'}_{ui} = r_{ij} - (\mu + b_u + b_i) * r_{uj} is the rating given by user u to item j * N(u) is the set of items rated by the user u, whether ratings are known or unknown * \mu , b, S, Q are parameters of the algorithm. Q can be initialized from the SVD output, with each column normalized to mean=0. NSVD2 is different from NSVD1 only in that S=Q. NSVD2 is not as good as NSVD1, but adds to the total blend. SVD++ SVD++ contains both NSVD and SVD terms: : \hat{r}_{ui}= \mu + b_u + b_i + q_i^{T} p_u + {q}_i^{T} {p'}_u : p^\prime_u = |N(u)|^{-0.5} \sum_{j \in N(u)} s_j where b, P, Q, P', S are learned simultaneously. N(u) can be replaced by the whole set of known facts about the user, including, but not limited to, the items he or she selected: : p_u = \sum{k \in K(u)} \alpha_{ku} s_k Here, |K(u)|^{-0.5} is one possible choice for \alpha_{ku} . For an efficient algorithm, see "Feature-based Matrix Factorization" by Tianqi Chen, Zhao Zheng, Qiuxia Lu, Weinan Zhang, and Yong Yu http://arxiv.org/abs/1109.2271 . The term SVD++ had been proposed by Yehuda Koren. Combination of all the above For a combination of all the above, as well as the neighbourhood model, see Factorization meets the neighborhood: a multifaceted collaborative filtering model, by Yehuda Koren. The model is given by the formula (16) and includes biases, SVD, ASVD, NSVD, and KNN-like terms. Non-linear SVD correction There are 3 ways to impose non-linearity. All of them can be implemented as second-order corrections to the main SVD. Calibration The formula : \hat{r}_{ui} = \sum_f p_{uf}q_{if} is unrealistic since r_{ui} is restricted. The more realistic formula is: : \hat{r}_{ui} = g\left(\sum_f p_{uf}q_{if}\right) Here, g is a smooth function, which is fitted with smoothing splines. For each observation (u, i, r_{ui}) in the validation dataset, the predicted rating \hat{r}_{ui}=\sum_f p_{uf}q_{if} is calculated. The pairs (\hat{r}_{ui}, r_{ui}) are used to calculate the function g using smoothing splines. The training of this model should NOT be performed on the same database on which the main SVD is trained. It can be trained on the validation dataset used for fitting parameters. On the right there is a g() function for the Netflix problem. Note that at sufficiently high raw predictions, the expected rating is virtually independent on the prediction. This may significantly affect the recommendations, since a typical recommender works with movies of very high predicted rating. 2d non-linearity The formula : \hat{r}_{ui} = \sum_f p_{uf}q_{if} can be generalized as : \hat{r}_{ui} = \sum_f g_f\left(p_{uf},q_{if}\right) The correction is : \Delta\hat{r}_{ui}= \sum_f g_f(p_{uf}, q_{if}) , where g''f'' is a 2d smooth function fitted with smoothing 2d splines, for each feature separately. For training this model, some fraction of data (say, 3%) can be reserved, while other 97% are used for training the main SVD. The Singular Value Decomposition should be applied to the item features matrix, so that the item features are orthogonal to each other (i. e. the feature 5 is orthogonal to the feature 6). For each feature f, the set of triplets \left(p_{uf}, q_{if}, r_{ui}-\hat{r}_{ui}\right) is obtained. Then, the smoothing splines are applied, and the residuals are adjusted. Item-based non-linearity Suppose that the items are movies, and the feature 5 is a "plot vs dialogues" feature, scaled from 0 to 1. It seems unlikely that every user either likes movies with lot of dialogues (the more the better), or with few dialogues (the fewer the better), or just indifferent to the feature. The dependence may be non-monotinic, and include one or more local maxima, and if monotonic, it is not necessarily linear. For example: * Mike likes if this feature is around 0.8 * Jill likes if this feature is either close to 0 or close to 1 * Denis likes if this feature is above 0.9, otherwise he is indifferent to it In order to formalize such considerations, we first apply the Singular Value Decomposition to the item features matrix, making the item features orthogonal to each other (i. e. the feature 5 is orthogonal to the feature 6). Then, the correction is fitted as : \Delta\hat{r}_{ui}= \sum_f g_{uf}(q_{if}) , where g''f'' is a smooth function fitted with smoothing splines, for each user and feature separately. This is performing by creating, for each (u, f) pair, the set of pairs \left(q_{if}, r_{ui}-\hat{r}_{ui}\right) , and application of smoothing splines. Unlike two previous non-linear algorithms, this one requires the lot of data and thus should be trained on the same training set as the main SVD. Time-dependent SVD It is important that each user's taste changes with time. The public opinion about an item also changes with time. Decaying weights The simplest way to take time into account is to weight past ratings according to their date: : d_{ui} = r_{ui} - p_u q_i\! : \Delta q_i = \eta \left(d_{ui} p_u -\lambda q_i\right) \exp((t_{ui}-t_0)/\tau)\! : \Delta p_u = \eta \left(d_{ui} q_i -\lambda p_u\right) \exp((t_{ui}-t_0)/\tau)\! : q_i \leftarrow q_i + \Delta q_i \! : p_u \leftarrow p_u + \Delta p_u \! where t_0 is the time of prediction, t_{ui} is the time of choosing item i by user u. Time-dependent features Another way to take time into account is : \hat r_{ui} = \sum_f g_{uf}(t_{ui}) q_{if} where g_{uf} is a smooth function fitted as a spline, for each user and feature separately. Here r means the residual. In this way, in order to find g, for each user u and feature f we minimize : \sum_i\left(r_{ui} - g_{uf}(t_{ui}) q_{if}\right)^2 This is the same as minimizing : \sum_i q_{if}^2\left( g_{uf}(t_{ui}) - r_{ui}/q_{if}\right)^2 Here, g can be fitted with smoothing splines, with q_{if}^2 being the weight.